-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add Pandas DataFrame support to TabularDataset #1185
Conversation
df_source (pd.DataFrame): | ||
Required. Pandas DataFrame containing the source data for | ||
ingestion as a TabularDataset. | ||
staging_path (str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the location requirements should also be documented or a reference to this documentation should be provided: https://cloud.google.com/vertex-ai/docs/general/locations#bq-locations
If possible they should be validated but not a hard requirement. Is it possible for the dataset create to fail because of the regional requirements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I just tested it and it can fail if the dataset location doesn't match the project location or the service doesn't have the right access to the dataset. I'll update the docstring to link to that page.
In terms of validating, the BQ client throws this error: google.api_core.exceptions.FailedPrecondition: 400 BigQuery Dataset location eu
must be in the same location as the service location us-central1
.
Do you think we should validate as well or let the BQ client handle validation? If we do validation, we'd need to use the BQ client to check the location of the provided BQ dataset string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with relying on BQ client.
parquet_options = bigquery.format_options.ParquetOptions() | ||
parquet_options.enable_list_inference = True | ||
|
||
job_config = bigquery.LoadJobConfig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this config infer all the types? I see the enable_list_inference but I couldn't find a reference in the BQ docs for non list type inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this will infer the data types from the DF. From the BQ docs:
The destination table to use for loading the data. If it is an
existing table, the schema of the :class:`~pandas.DataFrame`
must match the schema of the destination table. If the table
does not yet exist, the schema is inferred from the
:class:`~pandas.DataFrame`.
I added a bq_schema
param to give the user the option to override the data type autodection, but I think I may need to add more client-side validation on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can rely on BQ client validation.
@@ -147,6 +148,30 @@ | |||
|
|||
_TEST_LABELS = {"my_key": "my_value"} | |||
|
|||
# create_from_dataframe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preference for an integration test as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 2 integration tests: one with the bq_schema
param and one without.
df_source (pd.DataFrame): | ||
Required. Pandas DataFrame containing the source data for | ||
ingestion as a TabularDataset. | ||
staging_path (str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with relying on BQ client.
@@ -19,12 +19,19 @@ | |||
|
|||
from google.auth import credentials as auth_credentials | |||
|
|||
from google.cloud import bigquery | |||
from google.cloud.bigquery import _pandas_helpers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't rely on a private module from a dependent python package because it may break between minor versions.
Required. The user-provided BigQuery schema. | ||
""" | ||
try: | ||
_pandas_helpers.dataframe_to_arrow(dataframe, schema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the BQ client handles this validation we can rely on the client and avoid importing this private module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this validation check and letting the BQ client handle it.
parquet_options = bigquery.format_options.ParquetOptions() | ||
parquet_options.enable_list_inference = True | ||
|
||
job_config = bigquery.LoadJobConfig( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can rely on BQ client validation.
dataset.location = _TEST_LOCATION | ||
shared_state["bigquery_dataset"] = bigquery_client.create_dataset(dataset) | ||
|
||
yield |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like BQ dataset deletion should be handled after the yield.
|
||
aiplatform.init(project=_TEST_PROJECT, location=_TEST_LOCATION) | ||
|
||
tabular_dataset = aiplatform.TabularDataset.create_from_dataframe( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tabular_dataset should be appended to shared_state['resources']
so it's deleted after the test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated both new tests to use shared_state['resources']
bigquery.SchemaField(name="string_array_col", field_type="STRING", mode="REPEATED"), | ||
bigquery.SchemaField(name="bytes_col", field_type="STRING"), | ||
] | ||
|
||
|
||
class TestDataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should enherit from the base e2e class so it can take advantage of the resource cleanup functionality: https://github.com/googleapis/python-aiplatform/blob/main/tests/system/aiplatform/e2e_base.py#L37
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to inherit from e2e_base
. I noticed there was some duplication of other fixtures in this file so I removed that and inherited the fixtures from the base class.
bq_schema (Optional[Union[str, bigquery.SchemaField]]): | ||
Optional. The schema to use when creating the staging table in BigQuery. For more details, | ||
see: https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.LoadJobConfig#google_cloud_bigquery_job_LoadJobConfig_schema | ||
This is not needed if the BigQuery table provided in `staging_path` already exists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this imply when the staging table does not exit, user need to pass in BQ schema? No existing staging table is the most common case, and this would be a bit hard to use - users need to learn how to create BQ schema, etc.
Given df_source already has a schema, can we auto generate the BQ schema for users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best case is BQ client support inferencing schema natively, could you check on that? BQ UI supports inferencing schema for JSONL. Haven't tried Parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BQ client will autodetect the schema using the DataFrame column types if the user doesn't provide this bq_schema
parameter, this happens here. So I'm guessing most users won't use this schema param to override the autodetect.
I think my docstring was confusing. I updated it, let me know if this is clearer:
Optional. If not set, BigQuery will autodetect the schema using your DataFrame's column types.
If set, BigQuery will use the schema you provide when creating the staging table. For more details,
see: https://cloud.google.com/python/docs/reference/bigquery/latest/google.cloud.bigquery.job.LoadJobConfig#google_cloud_bigquery_job_LoadJobConfig_schema
Adds support for uploading a local Pandas DataFrame as a Vertex TabularDataset via a
create_from_dataframe
method on theTabularDataset
class.Also added relevant tests to
test_datasets.py
Fixes b/189369695 🦕